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ABSTRACT 

This  paper  outlines  a  prototype  text  retrieval  system 
which  uses  relatively  advanc^  natural  language  pro¬ 
cessing  techniques  in  order  to  enhance  the  effective¬ 
ness  of  statistical  document  retrieval.  The  backbone 
of  our  system  is  a  traditional  retrieval  engine  which 
builds  inverted  index  files  from  pre-processed  docu¬ 
ments,  and  then  searches  and  ranks  the  documents  in 
response  to  user  queries.  Natural  language  process¬ 
ing  is  used  to  (1)  preprocess  the  documents  in  order 
to  extract  contents-carrying  terms,  (2)  discover  inter¬ 
term  dependencies  and  build  a  conceptual  hierarchy 
specific  to  the  database  domain,  and  (3)  process 
user’s  natural  language  requests  into  effective  search 
queries.  The  basic  assumption  of  this  design  is  that 
term-based  representation  of  contents  is  in  principle 
sufficient  to  build  an  effective  if  not  optimd  search 
query  out  of  any  user’s  request.  This  has  been 
confirmed  by  an  experiment  that  compared  effective¬ 
ness  of  expert-user  prepared  queries  with  those 
derived  automatically  from  an  initial  nanative  infor¬ 
mation  request  In  this  paper  we  show  that  large- 
scale  natu^  language  processing  (hundreds  of  mil¬ 
lions  of  words  and  more)  is  not  only  required  for  a 
better  retrieval,  but  it  is  also  doable,  given  appropri¬ 
ate  resources.  We  report  on  selected  preliminary 
results  of  experiments  with  500  MByte  database  of 
Wall  Street  Journal  articles,  as  well  as  some  earlier 
results  with  a  smaller  document  collection. 


INTRODUCTION 

A  typical  information  retrieval  (IR)  task  is  to 
select  documents  from  a  database  in  response  to  a 
user’s  query,  and  rank  these  documents  according  to 
relevance.  This  has  been  usually  accomplished  using 
statisfical  methods  (often  coupled  with  manual 
encoding)  that  (a)  select  terms  (words,  phrases,  and 
other  units)  from  documents  that  are  deemed  to  best 
represent  their  contents,  and  (b)  create  an  inverted 
index  file  (or  files)  that  provide  and  easy  access  to 
documents  containing  these  terms.  An  important 
issue  here  is  that  of  finding  an  appropriate 


combination  of  term  weights  which  would  reflect 
each  term’s  relative  contribution  to  the  information 
contents  of  the  document.  Among  many  possible 
weighting  schemes  the  inverted  document  frequency 
(idf)  has  come  to  be  recognized  as  universally  appli¬ 
cable  across  variety  of  different  text  collections. 

Once  the  index  is  created,  the  search  process 
will  attempt  to  match  a  preprocessed  user  query  (or 
queries)  against  representations  of  documents  in  each 
case  determining  a  degree  of  relevance  between  the 
two  which  depends  upon  the  number  and  types  of 
matching  terms.  Although  many  sophisticated  search 
and  matching  methods  are  available,  the  crucial  prob¬ 
lem  remains  to  be  that  of  an  adequate  representation 
of  contents  for  both  the  documents  and  the  queries. 

The  simplest  word-based  representations  of 
contents  are  usually  inadequate  since  single  words 
are  rarely  specific  enough  for  accurate  discrimina¬ 
tion,  and  their  grouping  is  often  accidental.  A  better 
method  is  to  identify  groups  of  words  that  create 
meaningful  phrases,  especially  if  these  phrases 
denote  important  concepts  in  database  domain.  For 
example,  joint  venture  is  an  important  term  in  Wall 
Street  Journal  (WSJ  henceforth)  database,  while  nei¬ 
ther  joint  nor  venture  are  important  by  themselves.  In 
the  retrieval  experiments  with  the  WSJ  database,  we 
noticed  that  both  joint  and  venture  were  dropped 
frtjm  the  list  of  terms  by  the  system  because  their  idf 
weights  were  too  low.  In  large  databases,  such  as 
TBPSTER/TREC,  the  use  of  phrasal  terms  is  not  just 
desirable,  it  becomes  necessary. 

The  question  thus  becomes,  how  to  identify  the 
correct  phrases  in  the  text?  Both  statistical  and  syn¬ 
tactic  methods  were  used  before  with  only  limited 
success.  Statistical  methods  based  on  word  co¬ 
occurrences  and  mutual  information  are  prone  to  high 
error  rates,  turning  out  many  unwanted  associations. 
Syntactic  methods  suffered  from  low  quality  of  gen¬ 
erated  parse  structures  that  could  be  attributed  to  lim¬ 
ited  coverage  grammars  and  the  lack  of  adequate  lex¬ 
icons.  In  fact,  the  difficulties  encountered  in  applying 
computational  linguistics  technologies  to  text  pro¬ 
cessing  have  contributed  to  a  wide-spread  belief  that 
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automated  natural  language  processing  may  not  be 
suitable  in  IR.  These  difficulties  included 
inefficiency,  lack  of  robustness,  and  prohibitive  cost 
of  manual  effort  required  to  build  lexicons  and 
knowledge  bases  for  each  new  text  domain.  On  the 
other  hand,  while  numerous  experiments  did  not 
establish  the  usefulness  of  linguistic  methods  in  IR, 
they  cannot  be  considered  conclusive  because  of  their 
limited  scale. ' 

The  rapid  progress  in  Computational  Linguis¬ 
tics  over  the  last  few  years  has  changed  this  equation 
in  various  ways.  First  of  all,  large-scale  resources 
became  available:  on-line  lexicons,  including  Oxford 
Advanced  Learner’s  Dictionary  (OALD),  Longman 
Dictionary  of  Contemporary  English  (LDOCE), 
Webster’s  Dictionary,  Oxford  English  Dictionary, 
Collins  Dictionary,  and  others,  as  well  as  large  text 
corpora,  many  of  which  can  now  be  obtained  for 
research  purposes.  Robust  text-oriented  software 
tools  have  been  built,  including  part  of  speech 
taggers  (stochastic  and  otherwise),  and  fast  parsers 
capable  of  processing  text  at  speeds  of  4200  words 
per  minute  or  more  (e.g.,  TTP  parser  developed  by 
the  author).  While  many  of  the  fast  parsers  are  not 
very  accurate  (they  are  usually  partial  analyzers  by 
design),^  some,  like  TTP,  perform  in  fact  no  worse 
than  standard  full-analysis  parsers  which  are  many 
limes  slower  and  far  less  robust.  ^ 

An  accurate  syntactic  analysis  is  an  essential 
prerequisite  for  term  selection,  but  it  is  by  no  means 
sufficient.  Syntactic  parsing  of  the  database  contents 
is  usually  attempted  in  order  to  extract  linguistically 
motivated  phrases,  which  presumably  are  better  indi¬ 
cators  of  contents  than  "statistical  phrases"  where 
words  are  grouped  solely  on  the  basis  of  physical 
proximity  (e.g.,  "college  junior"  is  not  the  same  as 
"junior  college").  However,  creation  of  such  com¬ 
pound  terms  makes  term  matching  process  more 
complex  since  in  addition  to  the  usual  problems  of 
synonymy  and  subsumption,  one  must  deal  with  their 
structure  (e.g.,  "college  junior"  is  the  same  as  "junior 
in  college").  In  order  to  deal  with  structure,  parser’s 


'  Sundard  IR  benchmark  coUectioos  are  autistically  loo 
small  and  the  experiments  can  easily  produce  counterintuitive 
results.  For  example,  Cranheld  collection  is  only  approx.  180,000 
English  words,  while  CACM-3204  coUeclion  is  approx.  200,000 
words. 

’  Partial  parsing  is  usually  fast  enough,  but  it  abo  generates 
noisy  data:  as  many  as  30%  of  all  generated  phrases  could  be  in¬ 
correct  (Lewis  and  Croft,  1990). 

’  TTP  has  been  shown  to  produce  parse  structures  which  are 
no  worse  in  recall,  precision  and  crossing  rate  than  those  generated 
by  full-scale  linguisnc  parsers  when  compared  to  hand-coded 
Treebank  parse  trees. 


output  needs  to  be  "normalized"  or  "regularized"  so 
that  complex  terms  with  the  same  or  closely  related 
meanings  would  indeed  receive  matching  representa¬ 
tions.  This  goal  has  been  achieved  to  a  certain  extent 
in  the  present  work.  As  it  will  be  discussed  in  more 
detail  below,  indexing  terms  were  selected  from 
among  head-modifier  pairs  extracted  from  predicate- 
argument  representations  of  sentences. 

The  next  important  task  is  to  achieve  normali¬ 
zation  across  diferent  terms  with  close  or  related 
meaning.  This  can  be  accomplished  by  discovering 
various  semantic  relationships  among  words  and 
phrases,  such  as  synonymy  and  subsumption.  For 
example,  the  term  natural  language  can  be  con¬ 
sidered,  in  certain  domains  at  least,'  to  subsume  any 
term  denoting  a  specific  human  language,  such  as 
English.  Therefore,  a  query  containing  the  former 
may  be  expected  to  retrieve  documents  containing 
the  latter.  TTie  system  presented  here  computes  term 
associations  from  text  on  word  and  fixed  phrase  level 
and  then  uses  these  associations  in  query  expansion. 
A  fairly  primitive  filter  is  employed  to  separate 
synonymy  and  subsumption  relationships  from  others 
including  antonymy  and  complementation,  some  of 
which  are  strongly  domain-dependent.  This  process 
has  led  to  an  increased  retrieval  precision  in  experi¬ 
ments  with  smaller  and  more  cohesive  collections 
(CACM-3204). 

In  the  following  sections  we  present  an  over¬ 
view  of  our  system,  with  the  emphasis  on  its  text¬ 
processing  components.  We  would  like  to  point  out 
here  that  the  system  is  completely  automated,  i.e.,  all 
the  processing  steps,  those  performed  by  the  statisti¬ 
cal  core,  and  these  performed  by  the  natural  language 
processing  components,  are  done  automaticaUy,  and 
no  human  intervention  or  manual  encoding  is 
required. 


OVERALL  DESIGN 

Our  information  retrieval  system  consists  of  a 
traditional  statistical  backbone  (l^ST’s  PRISE  sys¬ 
tem;  Harman  and  Candela,  1989)  augmented  with 
various  natural  language  processing  components  that 
assist  the  system  in  database  processing  (stemming, 
indexing,  word  and  phrase  clustering,  selectional  res¬ 
trictions),  and  translate  a  user’s  information  request 
into  an  effective  query.  This  design  is  a  careful 
compromise  between  purely  statistical  non-linguistic 
approaches  and  those  requiring  rather  accomplished 
(and  expensive)  semantic  analysis  of  data,  often 
referred  to  as  'conceptual  retriev^’. 

In  our  system  the  database  text  is  first  pro¬ 
cessed  with  a  f^t  syntactic  parser.  Subsequently  cer¬ 
tain  types  of  phrases  are  extracted  from  the  parse 
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trees  and  used  as  compound  indexing  terms  in  addi¬ 
tion  to  single-word  terms.  The  extracted  phrases  are 
statistically  analyzed  as  syntactic  contexts  in  order  to 
discover  a  variety  of  similarity  links  between  smaller 
subphrases  and  words  occurring  in  them.  A  further 
filtering  process  maps  these  similarity  links  onto 
semantic  relations  (generalization,  specialization, 
synonymy,  etc.)  after  which  they  are  used  to 
transform  user’s  request  into  a  search  query. 

The  user’s  natural  language  request  is  also 
parsed,  and  all  indexing  terms  occurring  in  them  are 
identified.  Certain  highly  ambiguous,  usually  single¬ 
word  terms  may  be  dropped,  provided  that  they  iso 
occur  as  elements  in  some  compound  terms.  At  the 
same  time,  other  terms  may  be  added,  namely  those 
which  are  linked  to  some  query  term  through  admis¬ 
sible  similarity  relations.  For  example,  "unlawful 
activity"  is  added  to  a  query  containing  the  com¬ 
pound  term  "illegal  activity"  via  a  synonymy  link 
between  "illegal"  and  "unlawful".  After  the  final 
query  is  constructed,  the  database  search  follows,  and 
a  ranked  list  of  documents  is  returned. 

The  purpose  of  this  elaborate  linguistic  pro¬ 
cessing  is  to  create  a  better  representation  of  docu¬ 
ments  and  to  generate  best  possible  queries  out  of 
user’s  initial  requests.  Despite  limitations  of  term- 
and-weight  type  representation  (or  boolean  versions 
thereof),  very  good  queries  can  be  produced  by 
human  experts.  In  order  to  imitate  an  expert,  the  sys¬ 
tem  must  be  able  to  learn  about  its  database,  in  par¬ 
ticular  about  various  correlations  among  index  terms. 


FAST  PARSING  WITH  TTP  PARSER 

TTP  (Tagged  Text  Parser)  is  based  on  the 
Linguistic  String  Grammar  developed  by  Sager 
(1981).  The  parser  currently  encompasses  some  4(X) 
grammar  productions,  but  it  is  by  no  means  complete. 
The  parser’s  output  is  a  regularized  parse  tree 
representation  of  each  sentence,  that  is,  a  representa¬ 
tion  that  reflects  the  sentence’s  logical  predicate- 
argument  structure.  For  example,  logical  subject  and 
logical  object  are  identified  in  both  passive  and  active 
sentences,  and  noun  phrases  are  organized  around 
their  head  elements.  The  significance  of  this 
representation  will  be  discussed  below.  The  parser  is 
equipped  with  a  powerful  skip-and-fit  recovery 
mechanism  that  allows  it  to  operate  effectively  in  the 
face  of  ill-formed  input  or  under  a  severe  time  pres¬ 
sure.  In  the  runs  with  approximately  83  million  words 
of  TREC’s  Wall  Street  Journal  texts,**  the  parser’s 


*  Approximaiely  0.5  GBytes  of  text,  over  4  million  sen* 

fences. 


speed  averaged  between  0.3  and  0.5  seconds  per  sen¬ 
tence,  or  up  to  4200  words  per  minute,  on  a  Sun's 
SparcStation-2. 

TTP  is  a  full  grammar  parser,  and  initially,  it 
attempts  to  generate  a  complete  analysis  for  each 
sentence.  However,  unlike  an  ordinary  parser,  it  has  a 
built-in  timer  which  regulates  the  amount  of  time 
allowed  for  parsing  any  one  sentence.  If  a  parse  is  not 
returned  before  the  allotted  time  elapses,  the  parser 
enters  the  skip-and-fit  mode  in  which  it  will  try  to 
"fit"  the  parse.  While  in  the  skip-and-fit  mode,  the 
parser  will  attempt  to  forcibly  reduce  incomplete 
constituents,  possibly  skipping  portions  of  input  in 
order  to  restart  processing  at  a  next  unattempted  con¬ 
stituent.  In  other  words,  the  parser  will  favor  reduc¬ 
tion  to  backtracking  while  in  the  skip-and-fit  mode. 
The  result  of  this  strategy  is  an  approximate  parse, 
partially  fitted  using  top-down  predictions.  The  frag¬ 
ments  skipped  in  the  first  pass  are  not  thrown  out. 
instead  they  are  analyzed  by  a  simple  phrasal  parser 
that  looks  for  noun  phrases  and  relative  clauses  and 
then  attaches  the  recovered  material  to  the  main  parse 
structure.  As  an  illustration,  consider  the  following 
sentence  taken  from  the  CACM-3204  corpus: 

The  method  is  illustrated  by  the  automatic  con¬ 
struction  of  both  recursive  and  iterative  pro¬ 
grams  operating  on  natural  numbers,  lists,  and 
trees,  in  order  to  construct  a  program  satisfying 
certain  specifications  a  theorem  induced  by 
those  specifications  is  proved,  and  the  desired 
program  is  extracted  from  the  proof. 

The  italicized  fragment  is  likely  to  cause  additional 
complications  in  parsing  this  lengthy  string,  and  the 
parser  may  be  better  off  ignoring  this  fragment  alto¬ 
gether.  To  do  so  successfully,  the  parser  must  close 
the  currently  open  constituent  (i.e.,  reduce  a  program 
satisfying  certain  specifications  to  NP),  and  possibly 
a  few  of  its  parent  constituents,  removing 
corresponding  productions  from  further  considera¬ 
tion.  until  an  appropriate  production  is  reactivated. 
In  this  case.  TTP  may  force  the  following  reductions: 
SI -*  to  V NP;  SA  SI;  S^NPVNPSA,  until  the 
production  S  —*  S  and  S  is  reached.  Next,  the  parser 
skips  input  to  find  and,  and  resumes  normal  process¬ 
ing. 

As  may  be  expected,  the  skip-and-fit  strategy 
will  only  be  effective  if  the  input  skipping  can  be  per¬ 
formed  with  a  degree  of  determinism.  This  means 
that  most  of  the  lexical  level  ambiguity  must  be 
removed  from  the  input  text,  prior  to  parsing.  We 
achieve  this  using  a  stochastic  parts  of  speech  tagger 
to  preprocess  the  text.  Full  details  of  the  parser  can 
be  found  in  (Strzalkowski,  1992). 
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PART  OF  SPEECH  TAGGER 

One  way  of  dealing  with  lexical  ambiguity  is  to 
use  a  tagger  to  preprocess  the  input  marking  each 
word  with  a  tag  that  indicates  its  syntactic  categoriza¬ 
tion:  a  part  of  speech  with  selected  morphological 
features  such  as  number,  tense,  mode,  case  and 
degree.  The  following  are  tagged  sentences  from  the 
CACM-3204  collection:  ^ 

The/d/  paper/nn  presents/vir  a/<//  proposal//in 
for/i'n  structured^fen  representation/rm  of/m 
multiprogramming/vfc^  in/in  ajdt  high//)'  level/nn 
language/nn  Jper 

The/d/  notation/nn  used/vi>n  explicitly/r<) 
associates/v^z  sddt  data/n/u  structure/nn 
shared/v6n  by/in  concurrenty(/y  processes//i/i5 
with/in  operations/nni  defined/vta  onjin  iiipp 
■Iper 

The  tags  are  understood  as  follows:  dt  -  determiner, 
nn  -  singular  noun,  nns  -  plural  noun,  in  -  preposition, 
jj  -  adjective,  vbz  -  verb  in  present  tense  third  person 
singular,  to  -  particle  "to",  vbg  -  present  participle, 
vbn  -  past  participle,  vbd  -  past  tense  verb,  vb  - 
infinitive  verb,  cc  -  coordinate  conjunction. 

Tagging  of  the  input  text  substantially  reduces 
the  search  space  of  a  top-down  parser  since  it 
resolves  most  of  the  lexical  level  ambiguities.  In  the 
examples  above,  tagging  of  presents  as  "vbz"  in  the 
first  sentence  cuts  off  a  potentially  long  and  costly 
"garden  path"  with  presents  as  a  plural  noun  followed 
by  a  headless  relative  clause  starting  with  (that)  a 
proposal ....  In  the  second  sentence,  tagging  resolves 
ambiguity  of  used  (vbn  vs.  vbd),  and  associates  (vbz 
vs.  nns).  Perh^s  more  importantly,  elimination  of 
word-level  lexical  ambiguity  allows  the  parser  to 
make  projection  about  the  input  which  is  yet  to  be 
parsed,  using  a  simple  lookahead;  in  particular, 
phrase  boundaries  can  be  determined  with  a  degree 
of  confidence  (Church,  1988).  This  latter  property  is 
critical  for  implementing  skip-and-fit  recovery  tech¬ 
nique  outlined  in  the  previous  section. 

Tagging  of  input  also  helps  to  reduce  the 
number  of  parse  structures  that  can  be  assigned  to  a 
sentence,  decreases  the  demand  for  consulting  of  the 
dictionary,  and  simplifies  dealing  with  unknown 
words.  Since  every  item  in  the  sentence  is  assigned  a 
tag.  so  are  the  words  for  which  we  have  no  entry  in 
the  lexicon.  Many  of  these  words  will  be  tagged  as 
"np”  (proper  noun),  however,  the  surrounding  tags 
may  force  other  selections.  In  the  following  exam¬ 
ple.  Chinese,  which  does  not  appear  in  the  dictionary. 


’  Tagged  using  the  3S-ug  Penn  Treebank  Tagset  created  at 
the  University  of  Pennsylvania. 


is  tagged  as  "jj":^ 

this/i//  paper/w!  dates/vftz  backjrb  the/dt 
genesis/nn  of/in  binary/jj  conceplion/nn  circa/m 
5000/ctf  years/nnj  agolrb  Jcom  as/rb 
derived/vi?n  by/in  the/d/  chinese/y/  ancients/nnj 
Jper 


WORD  SUFFIX  TRIMMER 

Word  stemming  has  been  an  effective  way  of 
improving  document  recall  since  it  reduces  words  to 
their  common  morphological  root,  thus  allowing 
more  successful  matches.  On  the  other  hand,  stem¬ 
ming  tends  to  decrease  retrieval  precision,  if  care  is 
not  taken  to  prevent  situations  where  otherwise  unre¬ 
lated  words  are  reduced  to  the  same  stem.  In  our  sys¬ 
tem  we  replaced  a  traditional  morphological  stemmer 
with  a  conservative  dictionary-assisted  suffix  trim¬ 
mer.  ’  The  suffix  trimmer  performs  essentially  two 
tasks:  (1)  it  reduces  inflected  word  forms  to  their  root 
forms  as  specified  in  the  dictionary,  and  (2)  it  con¬ 
verts  nominalized  verb  forms  (e.g.,  "implementa¬ 
tion",  "storage")  to  the  root  forms  of  corresponding 
verbs  (i.e.,  "implement",  "store").  This  is  accom¬ 
plished  by  removing  a  standard  suffix,  e.g.. 
"stor+age",  replacing  it  with  a  standard  root  ending 
("+e"),  and  checking  the  newly  created  word  against 
the  dictionary,  i.e.,  we  check  whether  the  new  root 
("store”)  is  indeed  a  legal  word,  and  whether  the  ori¬ 
ginal  root  ("storage")  is  defined  using  the  new  root 
("store")  or  one  of  its  standard  inflectional  forms 
(e.g.,  "storing").  For  example,  the  following 
definitions  are  excerpted  from  the  Oxford  Advanced 
Learner’s  Dictionary  (OALD): 

storage  n  [U]  (space  used  for,  money  paid  for) 

the  storing  of  goods  ... 

diversion  n  [U]  diverting  ... 

procession  n  [C]  number  of  persons,  vehicles, 

etc  moving  forward  and  following  each  other  in 

an  orderly  way. 

Therefore,  we  can  reduce  "diversion"  to  "divert"  by 
removing  the  suffix  "+sion"  and  adding  root  form 
suffix  "-H".  On  the  other  hand,  "process-Pion"  is  not 
reduced  to  "process". 

Earlier  experiments  with  CACM-3204  collec¬ 
tion  showed  an  improvement  in  retrieval  precision  by 
6%  to  8%  over  the  base  system  equipped  with  a  stan¬ 
dard  morphological  stemmer  (the  SMART  stemmer). 


*  We  use  the  nudiine  readable  version  of  the  Oxford  Ad¬ 
vanced  Learner's  Dictionary  (OALD). 

’’  Dealing  with  prefixes  is  a  more  complicated  matter,  since 
they  may  have  quite  strong  effect  upon  the  meaning  of  the  resuli- 
ing  term.  e.g.,  un-  usually  introduces  expbcit  negation. 
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HEAD-MODIHER  STRUCTURES 


Syntactic  phrases  extracted  from  TTP  parse 
trees  are  head-modifier  pairs.  The  head  in  such  a  pair 
is  a  central  element  of  a  phrase  (main  verb,  main 
noun,  etc.),  while  the  modifier  is  one  of  the  adjunct 
arguments  of  the  head.  In  the  TREC  experiments 
reported  here  we  extracted  head-modifier  word  and 
fixed-phrase  pairs  only.  While  TREC  WSJ  database 
is  large  enough  to  warrant  generation  of  larger  com¬ 
pounds,  we  were  in  no  position  to  verify  their  effec¬ 
tiveness  in  indexing.  This  was  largely  because  of  the 
tight  schedule,  but  also  because  of  rapidly  escalating 
complexity  of  the  indexing  process:  even  with  2- 
word  phrases,  compound  terms  accounted  for  nearly 
96%  of  all  index  entries,  in  other  words,  including  2- 
word  phrases  has  increased  the  index  size  25  times! 

Let  us  consider  a  specific  example  from  WSJ 
database: 

The  former  Soviet  president  has  been  a  local 
hero  ever  since  a  Russian  tank  invaded  Wiscon¬ 
sin. 

The  tagged  sentence  is  given  below,  followed  by  the 
regularized  parse  structure  generated  by  TTP,  given 
in  Figure  1 . 

The/dr  former/^'  Soviet/jj  president/nn  has/vbz 
been/vbn  aJdl  local///'  hero/wi  tveilrb  since/w 
eJdt  Russian///  tank/nn  invaded/vM 
Wisconsin/np  ./per 

It  should  be  noted  that  the  parser’s  output  is  a 
predicate-argument  structure  centered  around  main 
elements  of  various  phrases.  In  Figure  1,  BE  is  the 
main  predicate  (modified  by  HAVE)  with  2  argu¬ 
ments  (subject,  object)  and  2  adjuncts  (adv,  sub_ord). 
INVADE  is  the  predicate  in  the  subordinate  clause 
with  2  arguments  (subject,  object).  The  subject  of 
BE  is  a  noun  phrase  with  PR^IDENT  as  the  head 
element,  two  modifiers  (FORMER,  SOVIET)  and  a 
determiner  (THE).  From  this  structure,  we  extract 
head-modifier  pairs  that  become  candidates  for  com¬ 
pound  terms.  The  following  types  of  pairs  are  con¬ 
sidered:  (1)  a  head  noun  and  its  left  adjective  or  noun 
adjunct,  (2)  a  head  noun  and  the  head  of  its  right 
adjunct,  (3)  the  main  verb  of  a  clause  and  the  head  of 
its  object  phrase,  and  (4)  the  head  of  the  subject 
phrase  and  the  main  verb.  These  types  of  pairs 
account  for  most  of  the  syntactic  variants  for  relating 
two  words  (or  simple  phrases)  into  pairs  carrying 
compatible  semantic  content.  For  example,  the  pair 
retrieve •¥infornvition  will  be  extracted  from  any  of 
the  following  fragments:  information  retrieval  sys¬ 
tem;  retrieval  of  information  from  databases;  and 
information  that  can  be  retrieved  by  a  user- 
controlled  interactive  search  process.  In  the  example 
at  hand,  the  following  head-modifier  pairs  are 
extracted  (pairs  containing  low-contents  elements. 


lassert 

IIperf[HAVE]) 

liverbfBEJ] 

I  subject 
Inp 

In  PRESIDENT] 

[t_pos  THE] 

|adj  [FORMER]] 
jadj  [SOVIET]]]] 

[object 

|np 

[n  HERO] 

[t_pos  A] 

[adj  [LOCAL]]]] 

[adv  EVER) 

[sub_ord 

[SINCE 

[[veib  [INVADE]] 

[subject 

[np 

[n  TANK] 

[t_pos  A] 

[adj  [RUSSIAN])]] 

[object 

Inp 

[name  [WISCONSIN]]])))))]] 

Figure  1.  Predicate-argument  parse  structure. 

such  as  BE  and  FORMER,  or  names,  such  as 
WISCONSIN,  will  be  later  discarded): 

[PRESIDENT.BE] 

rPRESIDENT.FORMER] 

[PRESIDENT,SOVIET] 

[BE.HERO] 

[HERO.LOCAL] 

ITANIUNVADE] 

[TANKJtUSSIAN] 

PNVADE,W1SC0NSIN] 

We  may  note  that  the  three-word  phrase  former 
Soviet  president  has  been  broken  into  two  pairs 
former  president  and  Soviet  president,  both  of  which 
denote  things  that  are  potentially  quite  different  from 
what  the  original  phrase  refers  to,  and  this  fact  may 
have  potentially  negative  effect  on  retrieval  preci¬ 
sion.  llus  is  one  place  where  a  longer  phrase  appears 
more  appropriate.  The  representation  of  this  sentence 
may  therefore  contain  the  following  terms: 

PRESIDENT.  SOVIET.  PRESIDENT+SOVIET. 
PRESIDENT+FORMER.  HERO.  HERO+LOCAL. 
invade,  tank.  TANK+INVADE.  TANK+RUSSIAN. 
RUSSIAN.  INVADE+ WISCONSIN.  WISCONSIN. 

The  particular  way  of  interpreting  syntactic 
contexts  was  dictated,  to  some  degree  at  least,  by  sta¬ 
tistical  considerations.  Our  original  experiments 
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were  performed  on  a  relatively  small  collection 
(CACM-3204),  and  therefore  we  combined  pairs 
obtained  from  different  syntactic  relations  (e.g., 
verb-object,  subject-verb,  noun-adjunct,  etc.)  in  order 
to  increase  frequencies  of  some  associations.  This 
became  largely  unnecessary  in  a  large  collection  such 
as  TIPSTER,  but  we  had  no  means  to  test  alternative 
options,  and  thus  decided  to  stay  with  the  original.  It 
should  not  be  difficult  to  see  that  this  was  a 
compromise  solution,  since  many  important  distinc¬ 
tions  were  potentially  lost,  and  strong  associations 
could  be  produced  where  there  weren’t  any.  A  way  to 
improve  things  is  to  consider  different  syntactic  rela¬ 
tions  independently,  perhaps  as  independent  sources 
of  evidence  that  could  lend  support  (or  not)  to  certain 
term  similarity  predictions.  We  have  already  started 
testing  this  option. 

One  difficulty  in  obtaining  head-modifier  pairs 
of  highest  accuracy  is  the  notorious  ambiguity  of 
nominal  compounds.  For  example,  the  phrase  natural 
language  processing  should  generate 
language+natural  and  processing+language,  while 
dynamic  information  processing  is  expected  to  yield 
processing+dynamic  and  processing+information.  A 
still  another  case  is  executive  vice  president  where 
the  association  president+executive  may  be  stretch¬ 
ing  things  a  bit  too  far.  Since  our  parser  has  no 
knowledge  about  the  text  domain,  and  uses  no 
semantic  preferences,  it  does  not  aaempt  to  guess  any 
internal  associations  within  such  phrases.  Instead, 
this  task  is  passed  to  the  pair  extractor  module  which 
processes  ambiguous  parse  structures  in  two  phases. 
In  phase  one,  all  and  only  unambiguous  head- 
modifier  pairs  are  extracted,  and  the  frequencies  of 
their  occurrences  are  recorded.  In  phase  two,  fre¬ 
quency  information  about  pairs  generated  in  the  first 
pass  is  used  to  form  associations  from  ambiguous 
structures.  For  example,  if  language+natural  has 
occurred  unambiguously  a  number  times  in  contexts 
such  as  parser  for  natural  language,  while 
processing+natural  has  occurred  significantly  fewer 
times  or  perhaps  none  at  all,  then  we  will  prefer  the 
former  association  as  valid. 


their  information  contribution  over  contexts  vary 
greatly,  are  the  common  contexts  in  which  these 
terms  occur  specific  enough?  In  general  we  will 
credit  high-contents  terms  appearing  in  identical  con¬ 
texts,  especially  if  these  contexts  are  not  too  com¬ 
monplace.*  The  relative  similarity  between  two 
words  Xi  and  x  2  can  be  obtained  using  the  following 
formula  (a  is  a  large  constant):  ’ 

SIM  (xi  ,j:2)  =  /og  (a  X  1  ,j:2)) 

y 


where 

simy{xi,X2)  =  MIN  (,ICixt,[x  i.y]),IC  (xi.lx-i.v])) 
*  MINdCiy.  U,.y])./CCv'  [jr^'y])) 


and  IC  is  the  Information  Contribution  measure  indi¬ 
cating  the  strength  of  word  pairings,  and  defined  as 


IC(x,[x,y])  = 


f.y 

n^+d^-l 


where  fj,  y  is  the  absolute  frequency  of  pair  [x,y]  in 
the  corpus,  is  the  frequency  of  term  x  at  the  head 
position,  and  dj  is  a  dispersion  parameter  understood 
as  the  number  of  distinct  syntactic  contexts  in  which 
term  x  is  found.  The  similarity  function  is  further 
normalized  with  respect  to  SIMixi,Xi).  Example 
similarities  are  listed  in  Table  1. 

We  also  considered  a  term  clustering  option 
which,  unlike  the  similarity  formula  above,  produces 
clusters  of  related  words  and  phrases,  but  will  not 
generate  uniform  term  similarity  ranking  across  clus¬ 
ters.  We  used  a  variant  of  weighted  Tanimoto's 
measure  described  in  (Grefenstette,  1992); 

JfdIN(Wax,ati]),W([y,ati]) 

gft 

SIM  (X ,  .X2)  =  ^AX{W([x,att]),W(ly,att]) 

an 


with 


TERM  CORRELATIONS  FROM  TEXT 

Head-modifier  pairs  form  compound  terms 
used  in  database  indexing.  They  also  serve  as 
occurrence  contexts  for  smaller  terms,  including 
single-word  terms.  If  two  terms  tend  to  be  modified 
with  a  number  of  common  modifiers  and  otherwise 
appear  in  few  distinct  contexts,  we  assign  them  a 
similarity  coefficient,  a  real  number  between  0  and  1. 
The  similarity  is  determined  by  comparing  distribu¬ 
tion  characteristics  for  both  terms  within  the  corpus: 
how  much  information  contents  do  they  carry,  do 


W{lx,y])  =  GW(x)Vog{f,,y) 


^Mog 

riy 

fr.y 

GW(x)=l-X 

> 

log  (N) 

a 

'  It  would  not  be  appropriate  to  predia  timilarity  between 
language  and  logarithm  on  the  basis  of  their  oo-occurrence  with 
natural. 

'This  was  inspired  by  a  foiniula  used  by  Hindle  (1990). 
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Sample  clusters  obtained  from  approx.  100  MByte 
(17  million  words)  sample  of  WSJ  are  given  in  Table 
2. 

In  order  to  generate  better  similarities  and  clus¬ 
ters.  we  require  that  words  xj  and  xi  appear  in  at 
least  M  distinct  common  contexts,  where  a  common 
context  is  a  couple  of  pairs  [jti.y]  and  [j:2,y],  or 
[y,Xi]  and  [yjraJ  such  that  they  each  occurred  at  least 
twice.  Thus,  banana  and  Baltic  will  not  be  con¬ 
sidered  for  similarity  relation  on  the  basis  of  their 
occurrences  in  the  common  context  of  republic,  no 
matter  how  frequent,  unless  there  is  another  such 
common  context  comparably  frequent  (there  wasn’t 
any  in  TREC  WSJ  database).  For  smaller  or  narrow 
domain  databases  M=2  is  usually  sufficient.  For  large 
databases  covering  rather  diverse  subject  matter,  like 
TIPSTER  or  even  WSJ,  we  used  A^>3.'° 

It  may  be  worth  pointing  out  that  the  similari¬ 
ties  are  calculated  using  term  co-occurrences  in  syn¬ 
tactic  rather  than  in  document-size  contexts,  the  latter 
being  the  usual  practice  in  non-linguistic  clustering 
(e.g.,  Sparck  Jones  and  Barber,  1971;  Crouch,  1988; 
Lewis  and  Croft,  1990).  Although  the  two  methods  of 
term  clustering  may  be  considered  mutually  comple¬ 
mentary  in  certain  situations,  we  believe  that  more 
and  stronger  associations  can  be  obtained  through 
syntactic-context  clustering,  given  sufficient  amount 
of  data  and  a  reasonably  accurate  syntactic  parser." 


QUERY  EXPANSION 

Similarity  relations  are  used  to  expand  user 
queries  with  new  terms,  in  an  attempt  to  make  the 
final  search  query  more  comprehensive  (adding 
synonyms)  and/or  more  pointed  (adding  specializa- 
tions).‘^  It  follows  that  not  all  similarity  relations  will 
be  equally  useful  in  query  expansion,  for  instance, 
complementary  and  antonymous  relations  like  the 


For  example  banana  and  Dominican  were  found  to  have 
two  common  contexts:  republic  and  plant ,  although  this  second  oc¬ 
curred  in  apparently  different  senses  in  Dominican  plant  and  bana¬ 
na  plant. 

"  Non-syntactic  contexts  cross  sentence  boundaries  with  no 
fuss,  which  is  helpful  with  short,  succinct  documents  (such  as 
CACM  abstracts),  but  less  so  with  longer  texts;  see  also  (Giishman 
etal..  1986). 

(^ry  expansion  (in  the  sense  considered  here,  though  not 
quite  in  the  same  way)  has  been  used  in  infoimation  retrieval 
research  before  (e.g.,  Sparck  Jones  and  Tait,  1984;  Harman,  1988). 
usually  with  mixed  results.  An  alternative  is  to  use  term  clusten  to 
create  new  terms,  "metaterms",  and  use  them  to  index  the  database 
instead  (e.g..  Crouch,  1988;  Lewis  and  Croft.  1990).  We  found  that 
the  query  expansion  approach  gives  the  system  more  flexibility,  for 
instance,  by  making  room  for  hypertext-style  topic  exploration  via 
user  feedback. 


one  between  Australian  and  Canadian,  or  accept  and 
reject  may  actually  harm  system's  performance, 
since  we  may  end  up  retrieving  many  irrelevant 
documents.  Similarly,  the  effectiveness  of  a  query 
containing  vitamin  is  likely  to  diminish  if  we  add  a 
similar  but  far  more  general  term  such  as  acid.  On 
the  other  hand,  database  search  is  likely  to  miss 
relevant  documents  if  we  overlook  the  fact  that  for¬ 
tran  is  a  programming  language,  or  that  infant  is  a 
baby  and  baby  is  a  child.  We  noted  that  an  average 
set  of  similarities  generated  from  a  text  corpus  con¬ 
tains  about  as  many  "good”  relations  (synonymy, 
specialization)  as  "bad"  relations  (antonymy.  comple¬ 
mentation,  generalization),  as  seen  from  the  query 
expansion  viewpoint.  Therefore  any  attempt  to 
separate  these  two  classes  and  to  increase  the  propor¬ 
tion  of  "good"  relations  should  result  in  improved 
retrieval.  This  has  indeed  been  confirmed  in  our 
experiments  where  a  relatively  crude  filter  has  visibly 
increased  retrieval  precision. 


In  order  to  create  an  appropriate  filter,  we  dev¬ 
ised  a  global  term  specificity  measure  (GTS)  which  is 
calculated  for  each  term  across  all  contexts  in  which 
it  occurs.  The  general  philosophy  here  is  that  a  more 
specific  word/phrase  would  have  a  more  limited  use, 
i.e.,  a  more  specific  term  would  appear  in  fewer  dis¬ 
tinct  contexts.  In  this  respect,  GTS  is  similar  to  the 
standard  inverted  document  frequency  (idf)  measure 
except  that  term  frequency  is  measured  over  syntactic 
units  rather  than  document  size  units."  Terms  with 
higher  GTS  values  are  generally  considered  more 
specific,  but  the  specificity  comparison  is  only  mean¬ 
ingful  for  terms  which  are  already  known  to  be  simi¬ 
lar.  The  new  function  is  calculated  according  to  the 
following  formula: 

[ ICi{w)  *  ICpiw)  if  both  exist 


GTSiw)=-{ 


lCp{w) 


if  only  ICgiw)  exists 


[0 


otherwise 


where  (with  n^,  d^,  >  0): 


/Ci(w)  =  /C([w._])  = 
/C,(w)  =  /C([_,w])  = 


1) 


For  any  two  terms  W|  and  >^2,  and  a  constant  6  >  1. 
if  GrS(w2)S6*  CTS(h'i)  then  wt  is  considered 
more  specific  than  W].  In  addition,  if 


”  We  beUeve  that  measuring  term  specifldty  over 
document-size  contexts  (e.g.,  Sparck  Jones.  1972)  may  not  be  ap- 
propnate  in  tfiis  case,  in  particular,  syntax -based  contexu  allow  for 
processing  texts  without  any  internal  document  stmcmre. 
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SIM^rm(Wi,W2)  =  0  >  0, 


where  0  is  an  empirically  established  threshold,  then 
W2  can  be  added  to  the  query  containing  term  W] 
with  weight  o.*'*  For  example,  the  following  were 
obtained  from  TREC  WSJ  training  database: 

GTS  (child)  =0.000001 

GTS  (baby)  =0.000013 

GTS  (infant)  =0.000055 

with 


SIM  (childfnfant)  =0.131381 
SIM  (baby, child)  =  0. 183064 
SIM  (baby, infant)  =  0.323121 

Therefore  both  baby  and  infant  can  be  used  to  spe¬ 
cialize  child.  With  this  filter,  the  relationship  between 
baby  and  infant  had  to  be  discarded,  as  we  are  unable 
to  tell  synonymous  or  near  synonymous  relationships 
from  those  which  are  primarily  complementary,  e.g., 
man  and  woman. 


SUMMARY  OF  RESULTS 

We  have  processed  the  total  of  500  MBytes  of 
articles  from  Wall  Street  Journal  section  of  TREC 
database.  Retrieval  experiments  involved  50  user 
information  requests  (topics)  (TREC  topics  51-100) 
consisting  of  several  fields  that  included  both  text  and 
user  supplied  keywords.  A  typical  topic  is  shown 
below: 


■<lop> 

<head>  Tipster  Topic  Descripdon 
<num>  Number  059 
<dom>  Domain:  Environment 
<title>  Topic:  Weather  Related  Fatalities 


<desc>  Description: 

Document  will  report  a  type  of  weather  event  which  has 
directly  caused  at  least  one  fatality  in  some  location. 

<aaiT>  Narrative: 

A  relevant  document  will  include  the  number  of  people 
killed  and  injured  by  the  weather  event,  as  well  as 
reporting  the  type  of  weather  event  and  the  location 
of  the  event. 

<coo>  Cancept(t): 


For  CACM-3204  coUection  the  filter  was  most  effective  at 
O  =  037.  For  TREC-1  we  changed  the  similarity  formula  slightly 
in  order  to  obtain  correct  normalizations  in  all  cases.  This  however 
lowered  similarity  coefficients  in  general  and  a  new  threshtdd  had 
to  be  selected.  We  used  o  =  0.1  in  TREC-1  runs,  although  it  turned 
out  to  be  a  poor  choice.  In  all  cases  S  varied  between  10  and  100. 


1.  lightning,  avalanche,  tornado,  typhoon,  humcane. 
heat,  heat  wave,  flood,  snow.  rain,  downpour, 
blizzard,  stoim.  freezing  temperatures 

2.  dead,  killed,  fatal,  death,  fatality,  victim 

3.  NOT  man-made  disasters.  NOT  war-induced  famine 

4.  NOT  earthquakes,  NOT  volcanic  eruptions 

</top> 

Note  that  this  topic  actually  consists  of  two  different 
statements  of  the  same  query:  the  natural  language 
specification  consisting  of  <desc>  and  <narr>  fields, 
and  an  expert-selected  list  of  key  terms  which  are 
often  far  more  informative  than  the  narrative  part. 
Results  obtained  for  queries  using  text  fields  only  and 
those  involving  both  text  and  keyword  fields  are 
reported  separately.  Further  experiments  have  sug¬ 
gested  that  natural  language  processing  impact  is 
significant  but  may  be  severely  limited  by  the  expres¬ 
siveness  of  the  term-based  representation.  Since  the 
<con>  field  is  considered  the  expert-user’s  rendering 
of  the  ‘optimal’  search  query,  our  system  is  able  to 
discover  much  of  it  from  a  less  complete 
specification  in  the  text  section  of  the  request  via 
query  expansion.  In  fact,  we  noted  that  the 
recall/precision  gap  between  automatically  generated 
queries  and  those  supplied  by  the  user  was  largely 
closed  when  NLP  was  used.  Moreover,  even  with  the 
keyword  field  included  in  the  query  along  with  other 
fields,  NLP’s  impact  on  the  system’s  performance  is 
still  noticeable. 

Other  results  on  the  impact  of  different  fields  in 
TREC  topics  on  the  final  rec^l/precision  results  were 
reported  by  Broglio  and  Croft  (1993)  at  the  ARPA 
HLT  woitehop,  although  text-only  runs  were  not 
included.  One  of  the  most  striking  observations  they 
have  made  is  that  the  narrative  field  is  entirely 
disposable,  and  moreover  that  its  inclusion  in  the 
query  actually  hurts  the  system’s  performance.  It  has 
to  be  point^  out,  however,  that  they  do  little 
language  processing.’^ 

Summary  statistics  for  these  runs  are  shown  in 
Table  4.  These  results  are  fairly  tentative  and  should 
be  regarded  with  some  caution.  For  one.  the  column 
named  txt  reports  performance  of  <desc>  and  <nan> 
fields  which  have  been  processed  with  our  suffix- 
trimmer.  This  means  some  NLP  has  been  done 
already  (tagging  lexicon),  and  therefore  what  we 
see  there  is  not  the  performance  of  ‘pure’  statistical 
system.  The  same  tqiplies  to  con  column.  (For 


Bnice  Croft  (personal  communication,  1992)  has  suggest¬ 
ed  that  excluding  all  expert-made  fields  (i.e.,  <con>  and  <fac>) 
would  make  the  queries  quite  ineffective.  BrogUo  (personal  com¬ 
munication,  1993)  confirms  this  showing  that  text-only  retrieval 
(i.e.,  with  <desc>  and  ^aiT>)  shows  an  average  precision  at  more 
than  30%  below  that  of  <oon>-based  retrievaL 
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wordl 

word2 

SIMnonm 

abm 

*anti+ballistic 

0.534894 

absence 

^maternity 

0.233082 

accept 

acquire 

0.179078 

accord 

pact 

0.492332 

acquire 

purchase 

0.449362 

speech 

address 

0.263789 

adjustable 

one+year 

0.824053 

maxsaver 

*  advance  +  purchase 

0.734008 

affair 

scandal 

0.684877 

affordable 

low+income 

0.181795 

disease 

*ailment 

0.247382 

medium+range 

*air+to+air 

0.874508 

aircraft 

^jetliner 

0.166777 

aircraft 

plane 

0.423831 

airline 

carrier 

0.345490 

alien 

immigrate 

0.270412 

anniversary 

^bicentennial 

0.588210 

anti+age 

anti+wrinkle 

0.153918 

anti+clot 

cholesterol+lower 

0.856712 

contra 

*anti+sandinista 

0.294677 

candidate 

*aspirant 

0.116025 

contend 

*aspirant 

0.143459 

property 

asset 

0.285299 

attempt 

bid 

0.641592 

await 

pend 

0.572960 

stealth 

*b+l 

0.877582 

child 

*baby 

0.183064 

baggage 

luggage 

0.607333 

ban 

restrict 

0.321943 

bearish 

bullish 

0.847103 

bee 

^honeybee 

0.461023 

roller+coast 

*bumpy 

0.898278 

two+income 

two+earner 

0.293104 

television 

tv 

0.806018 

soldier 

troop 

0.374410 

treasury 

*short+term 

0.661133 

research 

study 

0.209257 

withdrawal 

*pullout 

0.622558 

Table  1.  Selecte  filtered  word  similarities  (*  indicates 
the  more  specific  term). 


word 

cluster 

takeover 

merge,  buy-out 
acquisition,  bid 

stock 

share,  issue,  bond,  price 

staff 

personnel,  employee,  force 

share 

stock,  issue,  fund 

sensitive 

crucial,  difficult,  critical 

rumor 

speculate 

president 

director,  executive 
chairman,  manage 

outlook 

forecast,  prospect 
trend,  picture 

law 

rule,  legislate 
bill,  regulate 

earnings 

revenue,  income 

portfolio 

asset,  invest,  loan 
property,  hold 

inflate 

growth,  earnings,  rise 

industry 

business,  company,  market 

help 

additional,  support,  involve 

growth 

increase,  rise,  gain 
decline,  earnings,  profit 

firm 

bank,  concern,  group,  unit 

environ 

climate,  condition 
situation,  trend 

debt 

loan,  secure,  bond 

custontfer) 

client,  investor 
buyer,  consume(r) 

counsel 

attorney 

compute 

machine,  software 

competitor 

rival,  partner,  buyer 

company 

business,  firm,  bank 
market,  industry,  concern 

big 

large,  major,  huge 

base 

facile,  source 
reserve,  support 

asset 

property,  loan,  fund,  invest 
share,  stock,  money 

Table  2.  Selected  clusters  obtained  from  approx.  10^ 
words  of  text  with  weighted  Tanimoto  formula. 
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comparison,  see  Table  3  where  runs  with  CACM- 
3204  collection  included  ‘pure’  statistics  run  (base), 
and  note  the  impact  our  suffix  trimmer  is  having.) 
Nonetheless,  one  may  notice  that  automated  NLP  can 
be  very  effective  at  discovering  the  right  query  from 
an  imprecise  narrative  specification:  as  much  as  82% 
of  the  effectiveness  of  the  expert-generated  query  can 
be  attained. 


CONCLUSIONS 

We  presented  in  some  detail  a  natural  language 
information  retrieval  system  consisting  of  an 
advanced  NLP  module  and  a  ‘pure’  statistical  core 
engine.  While  many  problems  remain  to  be  resolved, 
including  the  question  of  adequacy  of  term-based 
representation  of  document  contents,  we  attempted  to 
demonstrate  that  the  architecture  described  here  is 
nonetheless  viable.  In  particular,  we  demonstrated 
that  natural  language  processing  can  now  be  done  on 
a  fairly  large  scale  and  that  its  speed  and  robustness 
can  match  those  of  traditional  statistical  programs 
such  as  key-word  indexing  or  statistical  phrase 
extraction.  We  suggest,  with  some  caution  until  more 
experiments  are  run,  that  natural  language  processing 
can  be  very  effective  in  creating  appropriate  search 
queries  out  of  user’s  initial  specifications  which  can 
be  frequently  imprecise  or  vague. 

On  the  other  hand,  we  must  be  aware  of  the 
limits  of  NLP  technologies  at  our  disposal.  While 
part-of-speech  tagging,  lexicon-based  stemming,  and 
parsing  can  be  done  on  large  amounts  of  text  (hun¬ 
dreds  of  millions  of  words  and  more),  other,  more 
advanced  processing  involving  conceptual  structur¬ 
ing,  logical  forms,  etc.,  is  still  beyond  reach,  compu¬ 
tationally.  It  may  be  assumed  that  these  super- 
advanced  techniques  will  prove  even  more  effective, 
since  they  address  the  problem  of  representation- 
level  limits,  however  the  experimental  evidence  is 
sparse  and  necessarily  limited  to  rather  small  scale 
tests  (e.g.,  Mauldin,  1991). 
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Runs 

base 

stiff,  trim 

query  exp. 

Recall 

Precision  Averages 

0.00 

0.764 

0.775 

0.793 

0.10 

0.674 

0.688 

0.700 

0.20 

0.547 

0.547 

0.573 

0.30 

0.449 

0.479 

0.486 

0.40 

0.387 

0.421 

0.421 

0.50 

0.329 

0.356 

0.372 

0.60 

0.273 

0.280 

0.304 

0.70 

0.198 

0.222 

0326 

0.80 

0.146 

0.170 

0.174 

0.90 

0.093 

0.112 

0.114 

1.00 

0.079 

0.087 

0.090 

3-pt  Avg. 
Wochg 

0328 

0.356 

-t-8.3 

0.371 

-t-n.l 

Table  3.  Run  statistics  for  CACM-3204  da¬ 
tabase:  with  no  NLP;  with  suffix  trimmer, 
and  with  both  phrases  and  similarities. 


Run 

txt 

txt+nlp 

con 

con-(-nlp 

Queries 

50 

50 

50 

50 

Tot.  number  of  docs  over  all  queries 

Ret 

9980 

9980 

9788 

9975 

Rel 

6228 

6228 

6228 

6228 

RelRet 

1598 

1835 

1927 

2062 

%chg 

-1-14.8 

-H20.6 

•f29.0 

Recall 

Precision  Averages 

0.00 

0.6420 

0.6917 

0.7021 

0.7539 

0.10 

0.3727 

0.4194 

0.4476 

0.4848 

0.20 

0.2476 

0.2959 

0.3353 

0.3641 

0.30 

0.1543 

0.2150 

0.2202 

0.2674 

0.40 

0.1093 

0.1513 

0.1443 

0.1735 

0.50 

0.0611 

0.0959 

0.0851 

0.1001 

0.60 

0.0298 

0.0396 

0.0403 

0.0665 

0.70 

0.0160 

0.0175 

0.0187 

0.0103 

0.80 

0.0046 

0.0047 

0.0048 

0.0024 

0.90 

0.0000 

0.0027 

0.0000 

0.0010 

1.00 

0.0000 

0.0000 

0.0000 

0.0010 

Average  Precisions 

11-pt 

0.1489 

0.1758 

0.1817 

0.2023 

%chg 

■t-18.0 

-1-22.0 

-1-35.8 

3-pt 

0.1044 

0.1322 

0.1417 

0.1.5.55 

%chg 

-1-26.6 

-1-35.7 

-t48.9 

at  5 

0.4360 

0.5000 

0.4680 

0.4800 

%chg 

-1-14.6 

-1-7.3 

•flO.O 

at  15 

0.3453 

0.3827 

0.3880 

0.4107 

%chg 

-hIO.8 

•fl2.3 

-fl8.9 

at  100 

0.2108 

03384 

03498 

0.2712 

%chg 

■t-13.0 

•fl8.5 

-1-28.6 

Table  4.  Run  statistics  with  TIPSTER  WSJ  database 
with  top  200  documents  considered  per  each  query: 
(1)  txt  -  with  <nan>  and  <desc>  fields  only;  (2) 
txt+nlp  -  with  <narr>  and  <desc>  only  including  syn¬ 
tactic  phrase  terms  and  similarities;  (3)  con  -  with 
<desc>  and  <con>  fields  only;  and  (4)  con+nlp  -  with 
<desc>  and  <con>  fields  including  phrases  and  simi¬ 
larities.  In  all  cases  documents  preprocessed  with 
lexicon-based  suffix-trimmer. 
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